Dataset analysis: beyjin movie dataset

Tasks:

  1. Setup environment, download data, get a general picture
  2. Obtain a list of the 10 directors who produced the most movies, in descending order.
  3. Free data mining and analysis of choice.

Allotted time: 2-3h.

Notebook requirements

Check if all listed python modules are installed (else pip install MODULE).

Notebook was created using Python 3.8.5, jupyter-notebook 6.1.4; check current versions:

This notebook will download the beyjin dataset csv files and create a database file in the current directory.

Task 1: Data collection and processing

Set up SQLite local database

Using SQLite because:

If user management, scalability or network access required, consider MariaDB (open-source fork of the now-proprietary MySQL).

Interface with database using sqlalchemy module. Create database file moviedb.sql in current directory.

Download dataset from kaggle

Note: kaggle api token (kaggle.json) required in home folder (see https://www.kaggle.com/docs/api).

Import all csv files as pandas DataFrames

DtypeWarning: [...] have mixed types: this is expected for non-cosistent data and will be addressed downstream.

Store initial data in database as a reference point

In Jupyter, cells are often executed out of order working on the same data, so it makes sense to replace database tables if they exists. In production, if_exists="replace" should be removed.

Let's check the import using SQL:

Task 1, continued: Data quality assessment

Let's have a look at the Movie_Movies DataFrames:

First-glance notes:

  1. NaN values are spotted in all columns, so the data might be sparse.
  2. In the last rows, column "Awards" shows various ways of denoting nominations. Countries appear both as comma-sep lists and singularly. Data consistency might be an issue.

A closer manual look may yield more artifacts, but these two issues fit the project scope well, and dealing with them systematically may uncover additional issues.

Rigorous cleaning of the dataset should only be performed after more specific goals have been set and time has been allotted.

How sparse are the datasets?

More specifically, how much of each column is empty? This will give an overall picture and also identify columns which carry more or less analytical, predictive or prescriptive value.

Let's obtain, for each DataFrames, the frequency of NaN values per column, and store these Series in a dict.

We see that NaN values are only found in the Movie_Movies dataset - the rest have no empty cells at all.

Let's get a picture of which columns of Movie_Movies carry the least actual information:

We see that this part of the data set is fairly inhomogenous:

Column Awards appears problematic: because of the sparseness of the data, we cannot be sure if a row has no Awards entry because the movie recieved no awards, or if that particular data is missing. In other words, Awards likely has high sensitivity, but may have low specificity.

Depending on the application, a decision will need to be made about how to treat missing values, depending on the analysis goals.

Specific preparation for task 2: Director column

With regard to task 2 (top 10 directors), we see that a 27334 movies have no director specified. Let's have a look at some of them, sampling the list randomly:

Sampling it a couple of times, it's probably safe to say that most of these should probably not be included in the directors rating, since they seem to consist of productions like

and therefore form outliers to the intended director's rating domain.

How consistent is the data?

Before systematically analyzing the data, we should make sure that the data types throughout the DataFrames

Ideally, for some data, we prefer a specific data type (e.g, dates should be of type datetime), so we should convert whereever convenient.

Let's find the type of all entries, for all DataFrames, and then have Pandas list the unique types per column, storing the results in a dictionary. Ideally, we'd like 1 type per column.

All DataFrames, except Movie_Movies, are consistent. However, in Movie_Movies, we see that there are multiple types per column, excluding imdbRating. Year in particular requires attention, showing three different types, but so do the other columns.

It is possible that the float values are only caused by the NaN entries, as NaN in Pandas comes from numpy, where its type is, in fact, float (this can be mitigated by fillna); however, we should still check systematically.

The function defined below, inspect_df_types, generates a long text output, performing a type introspection of Movie_Movies. To improve readability, this output is attached at the end of the notebook, but its discussion follows below the inspect_df_types definition.

inspect_df_types output discussion

Given a broader project scope, inspect_df_types should be modified to create a more succinct and less verbose output; it is plenty useful and clear enough for the current narrower scope to quickly identify a number of key issues by looking over the output.

Type inconsistencies

  1. As suspected, all float types are NaN values.
  2. In the Year column, years are denoted as strings, ints, floats, and even strings showing intervals, like '2003–2005'. Ideally, we'd like to cast all of them to datetime, but before we do, we need to address the interval values, and that will depend on the exact analysis. Options include casting intervals to pd.Timedelta, skipping them, using the lower or upper bound (perhaps for long production cycles), or creating a row per each year in that interval (perhaps for running series).
  3. Runtime values are strings that we'd like as some sort of time format, perhaps pd.Timedelta.
  4. imdbVotes are integers containing digit group separator commas as str. These should be ints.
  5. (low prio) Released, DVD strings should be converted to datetime.
  6. (low prio)Rated, Country, Type contain strings, but ideally we'd like to convert this to pandas Categoricals. This should improve performance, both operative and potentially predictive, and makes sense for these columns because here, the number of labels is relatively low.

Value format inconsistencies

  1. Runtime values are wildly inconsistent in how they are denoted ('3 h 30 min' '2,002 min' '324 min'); unifying will likely require parsing via a regex.
  2. Language and Director can be one or more entries, comma-separated; to-do: check if other separators are used.
  3. Rated contains a jumble of different categories, some of which should obviously be unified (like 'Not Rated' and 'Not rated'); some are difficult to discern.

Notes

  1. We see that the Movie_Movies contains movies and series. Due to the different nature, perhaps it should be split in two sets.
  2. This list is probably not exhaustive - the dataset is clearly messy, so additional issues are bound to show up downstream.

Given the project scope, this initial data quality assessment provides a passable understanding of the data and key issues. Because data cleanup is not an explicit goal, we can instead keep in mind the identified issues and proceed to the next task, addressing issues as necessary.

We now have a sufficient overview to review project goals and to discuss what value can and cannot be extracted from the data set, and what kind of resource investment it warrants.

Task 2: Top 10 directors

The task is to obtain a list of the 10 directors who produced the most movies, in descending order.

Let's first randomly sample from Director a couple of times, to see what kind of values the column contains.

So movies can be directed by zero, one or many people. This makes sense in terms of the dataset, so there's no need to modify it, but let's make some assumptions with respect to the task:

  1. We don't include movies with no director in the ranking (i.e, NaN is not considered a director).
  2. We consider co-directing as directing: directing a movie alone or with other co-directors is equivalent for the purpose of our rating. (Alternatively, we could consider groups as directors in their own right)
  3. The term "movies" is ambiguous: does it refer to the dataset, or specifically to movies, as opposed to series? We will assume it refers to the dataset, although the second option shouldn't present much additional challenge.
  4. We assume that there are no typos in the directors column. There likely are, but that's quite laborious to weed out and the error magnitude is negligible, since the error probability distribution is probably uniform.

Therefore, the list specified in Task 2 is:

['Jim Powers', 'Dave Fleischer', 'D.W. Griffith', 'Lewin Fitzhamon', 'Al Christie', 'Georges Méliès', 'Kevin Dunn', "Gilbert M. 'Broncho Billy' Anderson", 'Quasarman', 'James H. White']

Task 3: Free data mining and exploration

Interactive choropleth graph of productions per country

We can repeat the approach we took with the directors movie count list to see which countries produce the most movies, and visualize this via an interactive choropleth graph.

THE CHOROPLETH GRAPHS WILL ONLY SHOW UP IF THIS NOTEBOOK IS RUN LOCALLY AS JUPYTER NOTEBOOK, OR VIEWED AS EXPORTED HTML - IT SEEMS THAT GITHUB AND NBVIEWER CANNOT RENDER THE INTERACTIVE CHOROPLETH GRAPHS.

We immediately see that most movies are produced in the US; let's quickly plot the top 15 countries:

This looks like a case of Zipf's law - a country's number of productions is inversely proportional to its rank in the movie count table.

To create the choropleth graph, we will need the country names as ISO 3-letter codes. Let's create them from the country names. Note that not all Country entries will be translated: some of the movies were made in countries that no longer exist as such; how to deal with that is a matter of debate, but for now, we'll just skip them (with one notable exception - we'll manually change "UK" to "United Kingdom").

Because the movie counts drop off exponentially, the visualization only really highlights three countries - which accurately reflects absolute numbers, but offers little additional insight.

Just for kicks, let's transform the absolute number by dividing it by the GDP of the country in 2015, mean-normalize it, and then use a choropleth to show movie number as a function of the recent economic output.

We can obtain country GDP values by querying the World Bank API, and then inner-joining the response DataFrame with our country_movies_count_df.

Mean normalization (standard score): $$z = \frac{x - \mu}{\sigma}$$

This map shows the number of movies as a function of recent GDP, normalized and with outliers removed. So far, this is nothing more than a thought experiment, but this could be extended to a sophisticated analysis, linking time-specific GDP to time-specific movie production, and including population in the ranking.

Director movie count - Zipf distribution

Depending on the analysis goals, we might, for example, be interested in the distributions of the data, to get a better feel for the data set.

Looking further down the analytical / predictive pipeline, estimating the distribution functions would be important for Bayesian machine learning approaches.

Continuing where we left off in Task 2, let's plot the number of movies per director.

Here, because we have almost 100k entries, a bar plot makes little sense, and using a line plot is better.

It also looks like a case of Zipf's law - a director's number of productions is inversely proportional to his rank in the movie count table. Let's see how the log-log plot looks:

This looks a lot like Zipf's distribution. If we wanted to investigate further, we could attempt to estimate $\alpha$ using scipy.optimize.curve_fit and see how well the distributions correlate.

imdbRating distribution

Let's have a look at how imdbRating is distributed.

Let's also gather some basic statistics:

We see that the rating mean is 6.41, based off a rating from 1 to 10 (so an average rating would be 5.5). This raises some questions:

We could further investigate by, for instance, looking at how votes are distributed along time after a movie's release, include user reviews, introduce weighing functions, and reference additional sources.

Regarding the distribution function of the ratings, which is hinted at here by the line (kernel density estimator) - it is not immediately clear what the underlying distribution function is.

A quick web search reveals that this may be a Lévy skew distribution, but this would require further investigation.

https://arxiv.org/abs/0806.2305v3